7 Applied example
In this section, I will illustrate the use of plotscaper, showcasing the workflow and the various features of the package.
The Institute of Health Information and Statistics of the Czech Republic (IHIS, ÚZIS in Czech) is a government agency established by the Czech Ministry of Health. Its primary roles is to collect, process, and report on medical data within the country of Czechia (ÚZIS 2024). Of interest, the institute provides high-quality, open-access medical data, including information about the use and manufacture of medicines, summaries of fiscal and employment records in medical facilities, and various epidemiological datasets.
7.0.1 The data set
The data set (Soukupová et al. 2023) contains longitudinal information about psychiatric care in Czechia. More specifically, it contains aggregated data on individuals released from long-term psychiatric care facilities between 2010 and 2022. It includes information such the region of the treatment facility, the sex of the patients, age category, diagnosis based on the international ICD-10 classification (World Health Organization 2024a, 2024b), the number of hospitalizations, and the total number of days spent in care by the given subset of patients.
Here’s the data set at a quick glance:
## Rows: 68,115
## Columns: 12
## $ year <int> 2019, 2016, 2011, 2013, 2019, 2013, 2018, 2017,…
## $ region_code <chr> "CZ071", "CZ064", "CZ080", "CZ072", "CZ080", "C…
## $ region <chr> "Olomoucký kraj", "Jihomoravský kraj", "Moravsk…
## $ sex <chr> "female", "male", "male", "female", "male", "fe…
## $ diagnosis <chr> "f10", "f2", "f7", "f4 without f42", "f60–f61",…
## $ reason_for_termination <chr> "release", "early termination", "early terminat…
## $ age_category <chr> "40–49", "30–39", "30–39", "40–49", "50–59", "4…
## $ stay_category <chr> "short-term", "medium-term", "short-term", "sho…
## $ field <chr> "psychiatry", "psychiatry", "psychiatry", "psyc…
## $ care_category <chr> "adult", "adult", "adult", "adult", "adult", "a…
## $ cases <int> 13, 3, 2, 3, 1, 2, 2, 32, 2, 2, 1, 12, 1, 1, 2,…
## $ days <int> 196, 345, 38, 108, 1, 47, 319, 3813, 120, 256, …
The data set contains over 68,000 rows, totalling over 410,000 hospitalizations. Each row records the number patients with a particular set of of characteristics released from a treatment facility during a given year, and the number of days the patients spent in treatment in total.
The original dataset used Czech column names and category labels. To make the analysis more easily accessible to non-Czech speakers, I took the liberty of translating most of these to English (excluding the region variable). The translation script is available in the thesis repository, at the following path: ./data/longterm_care_translate.R. Additionally, the data set website contains a JSON schema with a text description of each of the variables (Soukupová et al. 2023). I took the liberty of translating these descriptions as well, and provide them below, in table Table 7.1:
| Translated name | Original name | Description |
|---|---|---|
| year | rok | The year hospitalization was terminated |
| region_code | kraj_kod | Region code based on the NUTS 3 classification |
| region | kraj_nazev | Region where the facility was located |
| sex | pohlavi | Classification of patients’ sex |
| diagnosis | zakladni_diagnoza | The primary diagnosis of the psychiatric disorder based on the ICD-10 classification |
| reason_for_termination | ukonceni | The reason for termination of care |
| age_category | vekova_kategorie | Classification of patients’ age category |
| stay_category | kategorie_delky_hospitalizace | Classification of hospitalization based on length: short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) |
| field | obor | The field of psychiatric care |
| care_category | kategorie_pece | Classification of care: child or adult |
| cases | pocet_hospitalizaci | The total number of cases/hospitalizations in the given subgroup of patients |
| days | delka_hospitlizace | The total time spent in care, in days (= sum of the number of days all patients in a given subgroup spent in care) |
7.0.2 Interactive exploration
Now it is time to start exploring the data using plotscaper.
The first to note about the data set is that the data has been aggregated, such that each row represents the combined number of releases within a given subset of patients. For example, the first row records that, in the year 2019, there were 13 females aged 40-49 who were released from treatment facilities in Olomoucký kraj, after having been in short-term care with F10 ICD-10 diagnosis (mental and behavioural disorders due to use of alcohol, World Health Organization 2024a) for a sum total of 196 days:
## year region_code region sex diagnosis reason_for_termination
## 1 2019 CZ071 Olomoucký kraj female f10 release
## age_category stay_category field care_category cases days
## 1 40–49 short-term psychiatry adult 13 196
Thus, the two primary continuous variables in the data are cases (the number of patients in a given subgroup released from care) and days (the number of days the given subgroup of patients spent in care). Intuitively, we should expect a fairly linear relationship between these variables, such that a larger group of patients should spend a greater number of days in care, in total. We can use plotscaper to visualize this relationship via a scatterplot:
library(plotscaper) # Load in the package
create_schema(df) |> # Create a schema that can be modified declaratively
add_scatterplot(c("cases", "days")) |>
render() # Render the figureInterestingly, there seems to be a leaf-shaped pattern in the data, with three distinct “leaflets”, each suggesting a linear relationship. If we look at the data, we can see that the stay_category variable has three levels, corresponding to short-term (< 3 months), medium-term (3-6 months), and long-term (6+ months) care. If we mark the cases corresponding to the three categories in different colors, we can see that these indeed correspond to the three leaflets:
df |>
create_schema() |>
add_scatterplot(c("cases", "days")) |>
add_barplot(c("stay_category", "cases")) |>
assign_cases(which(df$stay_category == "short-term"), 1) |>
assign_cases(which(df$stay_category == "long-term"), 2) |>
render()However, this does not really explain the absence of points between the different leaflets - if the distribution of cases and days within each of the three stay_category levels were uniform, we should expect to see more points in the gaps between the leaflets. This suggests that there may be a kind of a selection process at play, where patients are less likely to be released after at intervals which are close to the category boundaries. We can see this more easily if we plot the average number of days a group of patients spent in care:
df$avg_days <- df$days / df$cases
df |>
create_schema() |>
add_scatterplot(c("cases", "avg_days")) |>
add_barplot(c("stay_category", "cases")) |>
assign_cases(which(df$stay_category == "short-term"), 1) |>
assign_cases(which(df$stay_category == "long-term"), 2) |>
set_scale("scatterplot1", "y", transformation = "log10", default = TRUE) |>
render()Now we can see the gaps between the three different distributions along the y-axis. By pressing the Q key and hovering over the points near these gaps, we can see that there are only few patient groups where the average patients spends between 60-90 or 150-210 days in care (corresponding to gaps at 2-3 months and 5-7 months). This suggests that, perhaps, if a patient spends 2 months in care, the healthcare providers are more likely to keep them in care longer by transfer them to medium-term care, and likewise, if a patients spends 5 months in care, they are more likely to be transferred to long-term care (or, alternatively, the patients may be released early).